This material is based on chapter 3 of the Modern Statistics for Modern Biology book by Susan Holmes and Wolfgang Huber.
We want to visualise data to
Our learning objectives are
ggplot2;ggplot2;plotly, one package for ploducing interactive
visualisations.The default graphics system that comes with R, often called base R graphics is simple and fast. It is based on the painter’s model or canvas, where different output are directly overlaid on top of each other.
Below, we display the relation between the optical density of the deoxyribonuclease (DNase) protein as measure by an enzyme-linked immunosorbent assay (ELISA) assay for all observations.
head(DNase)
## Run conc density
## 1 1 0.04882812 0.017
## 2 1 0.04882812 0.018
## 3 1 0.19531250 0.121
## 4 1 0.19531250 0.124
## 5 1 0.39062500 0.206
## 6 1 0.39062500 0.215
plot(DNase$con, DNase$density)
Figure 1: The default base plot function on the DNase data
We can add some features on the plot, such vertical dotted lines for
all observed observations and customise the look and feel of the plot
by setting specific arguments to the plot function.
plot(DNase$con, DNase$density,
xlab = "DNase concentration (ng/ml)",
ylab = "Optical density",
pch = 1,
col = "steelblue")
abline(v = unique(DNase$conc), lty = "dotted")
Figure 2: Customising a base figure using function arguments and overlaying new graphical features
If we wanted to change anything to that figures, we would need to repeat all the commands and modify accordingly. Any additinal command would be added to the existing canvas.
Exercise: How would you produce a figure that differentiates the different runs using base graphics?
Exercise: use the hist and boxplot functions to produce and
histogram of all optical densities and a boxplot of the densities
split by run.
The base graphics function are very effective to quickly produce out of the box figures. However, there is not global overview and parametrisation of the visualisation. The layout decisions have to be made up upfront (and if not adequate, the figure needs to be redrawn) and every aspect of the figure is customised locally as function arguments.
More generally, base graphics functions will work with various inputs:
above we have worked with a data.frame, vectors and a formula. There
is no unified type of data across all functions which makes it
efficient for some types of data (if they match), but also very
heterogeneous in terms of interface, leading to a lot of customisation
code.
Finally, defaults, and colours in particular, are poorly chosen.
ggplot2 packageggplot2 is a plotting package that makes it simple to create
complex plots from data in a data frame. It provides a more
programmatic interface for specifying what variables to plot, how they
are displayed, and general visual properties. The theoretical
foundation that supports the ggplot2 is the Grammar of
Graphics1 Wilkinson, Leland. 2005. The Grammar of Graphics (Statistics
and Computing). Berlin, Heidelberg: Springer-Verlag.. Instead of producing the figure, the user defines and
assembles the visual components into an object that is the
displayed. There is a book about ggplot22 Wickham, Hadley. 2016. ggplot2: Elegant Graphics for Data
Analysis. Springer-Verlag New York. that provides a good
overview, but it is outdated. The ggplot2 web page
(https://ggplot2.tidyverse.org) provides ample documentation.
To build a ggplot, we will use the following basic template that can be used for different types of plots:
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
We need first to load the ggplot2 package:
library("ggplot2")
ggplot() function and bind the plot to a specific data frame using the
data argumentggplot(data = DNase)
Figure 3: We have only specified the data, and there’s nothing to display yet
aes) function), by
selecting the variables to be plotted and specifying how to present
them in the graph, e.g. as x/y positions or characteristics such as
size, shape, colour, etc.ggplot(data = DNase,
mapping = aes(x = conc, y = density))
Figure 4: ggplot2 can now generate the axes, ticks and ranges based on the data
+
operator. Because we have two continuous variables, let’s use
geom_point() first:ggplot(data = DNase,
mapping = aes(x = conc, y = density)) +
geom_point()
Figure 5: Final figures with rendering of the data as a scatter plot
Exercise: compare the ggplot2 and base graphics version of the
density vs. concentration plot. Which one do you prefer, and why?
It is possible to store the output of the ggplot function into a
variable that can be visualised by either typing its name in the
console or explicitly printing it (like any other variable).
gg <- ggplot(data = DNase,
mapping = aes(x = conc, y = density)) +
geom_point()
print(gg)
Figure 6: Saving and printing an object
Let’s immediately customise this visualisation to
- highlight how to re-use the gg object without repeating the
plotting code and
- how we can add additional (identical or different) geoms to a
plot.
gg + geom_point(aes(colour = Run))
Figure 7: Adding another geom_point with its own (local) aesthetics
Exercise: What do you think of the colours used to differentiate the different runs above?
Below is an example of more elaborated composition, overlaying points
and a non-linear loess regression. But first, let’s load a object
containing microarray data from the hiiragi2013.rda
file. The
data is originally available from the
Hiiragi2013
package and Cell-to-cell expression variability followed by signal
reinforcement progressively segregates early mouse lineages paper by
Y. Ohnishi et al. Nature Cell Biology (2014) 16(1):
27-37. doi:10.1038/ncb2881.
Below, we combine the (transposed) expression values (extracted from
the hiiragi2013 object with exprs) and sample metadata (ext-raced
from the hiiragi2013 object with pData)
load("../Raw_Data/hiiragi2013.rda")
library("Biobase")
dftx <- data.frame(t(Biobase::exprs(hiiragi2013)), pData(hiiragi2013))
# or dftx<- as.data.frame(hiiragi2013)
dftx[1:10, 1:3]
## X1415670_at X1415671_at X1415672_at
## 1 E3.25 4.910459 9.768979 10.411893
## 2 E3.25 7.526672 9.144228 10.918942
## 3 E3.25 6.956328 9.295010 9.495738
## 4 E3.25 6.424048 11.059831 10.317996
## 5 E3.25 5.105808 9.376749 11.143684
## 6 E3.25 5.856685 9.681017 10.234943
## 7 E3.25 5.059961 7.665886 10.642970
## 8 E3.25 4.574661 9.325743 9.304958
## 9 E3.25 8.123073 9.724729 11.037632
## 10 E3.25 5.464257 9.389818 9.754123
dftx[1:10, 45105:45109]
## lineage genotype ScanDate sampleGroup sampleColour
## 1 E3.25 WT 2011-03-16 E3.25 #CAB2D6
## 2 E3.25 WT 2011-03-16 E3.25 #CAB2D6
## 3 E3.25 WT 2011-03-16 E3.25 #CAB2D6
## 4 E3.25 WT 2011-03-16 E3.25 #CAB2D6
## 5 E3.25 WT 2011-03-16 E3.25 #CAB2D6
## 6 E3.25 WT 2011-03-16 E3.25 #CAB2D6
## 7 E3.25 WT 2011-03-16 E3.25 #CAB2D6
## 8 E3.25 WT 2011-03-16 E3.25 #CAB2D6
## 9 E3.25 WT 2011-03-16 E3.25 #CAB2D6
## 10 E3.25 WT 2011-03-16 E3.25 #CAB2D6
ggplot(dftx, aes(x = X1426642_at, y = X1418765_at)) +
geom_point(shape = 1) +
geom_smooth(method = "loess")
Figure 8: Modelling the relation between the expression of X1426642_at and X1418765_at
And, adding colours representing the different samples
ggplot(dftx, aes(x = X1426642_at, y = X1418765_at)) +
geom_point(aes(color = sampleColour), shape = 19) +
geom_smooth(method = "loess")
Figure 9: Modelling the relation between the expression of X1426642_at and X1418765_at and annotating samples
View(pData(hiiragi2013))
# Biobase::exprs(hiiragi2013)
Let’s start by exploring some 1 dimensional visualisation. This is very relevant for omics data such as transcriptomics or quantitative proteomics, when contrasting the expression values across multiple samples.
First, we convert a microarray gene expression data to a data.frame,
fit for some ggplot2 visualisation, focusing on genes Fgf4, Gata4,
Gata6 and Sox2.
selectedProbes <- c(Fgf4 = "1420085_at", Gata4 = "1418863_at",
Gata6 = "1425463_at", Sox2 = "1416967_at")
library("dplyr")
library("tidyr")
tmp <- data.frame(t(exprs(hiiragi2013[selectedProbes, ])))
names(tmp) <- names(selectedProbes)
tmp$sample <- rownames(tmp)
head(tmp)
## Fgf4 Gata4 Gata6 Sox2 sample
## 1 E3.25 3.027715 4.843137 5.500618 1.731217 1 E3.25
## 2 E3.25 9.293016 5.530016 6.160900 9.697038 2 E3.25
## 3 E3.25 2.940142 4.418059 4.584961 4.161240 3 E3.25
## 4 E3.25 9.715243 5.982314 4.753439 9.540123 4 E3.25
## 5 E3.25 8.924228 4.923580 4.629728 8.705340 5 E3.25
## 6 E3.25 11.325952 4.068520 4.165692 8.696228 6 E3.25
genes <- gather(tmp, key = "gene", value = "expression", -sample)
head(genes)
## sample gene expression
## 1 1 E3.25 Fgf4 3.027715
## 2 2 E3.25 Fgf4 9.293016
## 3 3 E3.25 Fgf4 2.940142
## 4 4 E3.25 Fgf4 9.715243
## 5 5 E3.25 Fgf4 8.924228
## 6 6 E3.25 Fgf4 11.325952
genes %>%
filter(gene == "Gata4") %>%
ggplot(aes(x = expression)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 10: Distribution of the Gata4 expression
p <- ggplot(genes, aes(x = gene, y = expression, fill = gene))
bxplot <- p + geom_boxplot()
bxplot
Figure 11: A boxplot of expression values
Exercise: Repeat the above figure replacing the boxes by violins
using the geom_violin. Which one do you think does a better job?
When the data aren’t too large, it is also possibly to visualise all points to get a sense of their distribution.
jtrplot <- p +
geom_jitter(aes(colour = gene)) +
theme(legend.position = "none")
In a dotplot, the position of the points along the y axis is discretised into bins (set as 1/6 below) and the points are then stacked next to each other.
dotplot <- p + geom_dotplot(binaxis = "y", binwidth = 1/6,
stackdir = "center", stackratio = 0.75,
aes(color = gene)) +
theme(legend.position = "none")
The beeswarm algorithms tries to avoid overlapping points.
library("ggbeeswarm")
beeplot <- p + geom_beeswarm(aes(color = gene)) +
theme(legend.position = "none")
library(patchwork)
jtrplot + dotplot + beeplot
Figure 12: Showing all expression values using jittering (left), a dotplot (centre) and a beeswarn plot
densplot <- ggplot(genes, aes(x = expression, color = gene)) +
geom_density() +
theme(legend.position = "none")
ecdfplot <- ggplot(genes, aes(x = expression, color = gene)) +
stat_ecdf() +
theme(legend.position = "none")
densplot + ecdfplot
Figure 13: Density and cumulative density functions of expression values
Boxplot makes sense for unimodal distributions (see below).
Histogram requires definition of bins (width, positions) and can create visual artefacts especially if the number of data points is not large.
Density requires the choice of bandwidth; obscures the sample size (i.e. the uncertainty of the estimate).
ecdf does not have these problems; but is more abstract and interpretation requires more training. Good for reading off quantiles and shifts in location in comparative plots.
beeswarm: for up to a few dozens of points, just show the data.
The number of modes of a distribution depends on scale transformation of the data.
sim <- data.frame(x = exp(rnorm(n = 1e5,
mean = sample(c(2, 5),
size = 1e5,
replace = TRUE))))
p1 <- ggplot(sim, aes(x)) +
geom_histogram(binwidth = 10, boundary = 0) +
xlim(0, 400)
p2 <- ggplot(sim, aes(log(x))) +
geom_histogram(bins = 30)
p1 + p2
## Warning: Removed 7913 rows containing non-finite values (stat_bin).
Figure 14: Histograms of the same data without (left) and with (right) log-transformation
This also applies to density plots.
dfx <- as.data.frame(Biobase::exprs(hiiragi2013))
scp <- ggplot(dfx, aes(x= `59 E4.5 (PE)`,
y = `92 E4.5 (FGF4-KO)`))
scp + geom_point()
Figure 15: Scatter plot comparing the expression of a wild-type vs
 FGF4 KO.
Exercise: The over-plotting of the dots stops us from learning
anything about the density of the different regions of the plot. Use
the alpha parameter to geom_points between 0 (full transparency)
to 1 (opaque, default).
scp + geom_density2d(h = 0.5, bins = 60)
Figure 16: Focusing on contours rather that individual values
scp + geom_hex()
Figure 17: Local density summaries
When visualising data along additional dimension, we can parameterise
the points by setting their shape, colour, size and transparency, that
can be set with point aesthetics such as fill, color (or
colour), shape, size and alpha.
A very powerful way to represent data along additional dimensions is facetting, i.e. producing sub-plots for different subsets of the data. Below, we first re-annotate the data using some regular expressions
p1 <- ggplot(dftx, aes(x = X1426642_at, y = X1418765_at, colour = lineage)) +
geom_point()
p2 <- ggplot(dftx, aes(x = X1426642_at, y = X1418765_at)) +
geom_point() +
facet_grid( . ~ lineage )
p1 + p2
Figure 18: Different sub-plots for different lineages using colours (left) of facets(right) to distinguish the different lineages
ggplot(dftx,
aes(x = X1426642_at, y = X1418765_at)) +
geom_point() +
facet_grid( Embryonic.day ~ lineage )
Figure 19: Different sub-plots for different lineages and embryonic stages
Exercise: Use facets to visualise the distribution of the four
Fgf4, Gata4, Gata6 and Sox2 genes in the genes data using
histograms.
library("plotly")
scp <- ggplot(dfx[1:100, ],
aes(x= `59 E4.5 (PE)`, y = `92 E4.5 (FGF4-KO)`))
scp2 <- scp + geom_point()
ggplotly(scp2)
Figure 20: Interactive visualisation with plotly
See https://plot.ly/r for examples of interactive graphics online.
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=French_Belgium.1252 LC_CTYPE=French_Belgium.1252
## [3] LC_MONETARY=French_Belgium.1252 LC_NUMERIC=C
## [5] LC_TIME=French_Belgium.1252
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] patchwork_0.0.1 ggbeeswarm_0.6.0 tidyr_0.8.3
## [4] plotly_4.9.0 Hmisc_4.2-0 ggplot2_3.2.0
## [7] Formula_1.2-3 survival_2.44-1.1 lattice_0.20-38
## [10] dplyr_0.8.2 Biobase_2.44.0 BiocGenerics_0.30.0
## [13] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 assertthat_0.2.1 digest_0.6.19
## [4] mime_0.7 plyr_1.8.4 R6_2.4.0
## [7] backports_1.1.4 acepack_1.4.1 evaluate_0.14
## [10] httr_1.4.0 highr_0.8 pillar_1.4.2
## [13] rlang_0.4.0 lazyeval_0.2.2 rstudioapi_0.10
## [16] data.table_1.12.2 hexbin_1.27.3 rpart_4.1-15
## [19] Matrix_1.2-17 checkmate_1.9.3 rmarkdown_1.13
## [22] labeling_0.3 splines_3.6.0 stringr_1.4.0
## [25] foreign_0.8-71 htmlwidgets_1.3 munsell_0.5.0
## [28] shiny_1.3.2 httpuv_1.5.1 compiler_3.6.0
## [31] vipor_0.4.5 xfun_0.8 pkgconfig_2.0.2
## [34] base64enc_0.1-3 htmltools_0.3.6 nnet_7.3-12
## [37] tidyselect_0.2.5 tibble_2.1.3 gridExtra_2.3
## [40] htmlTable_1.13.1 bookdown_0.11 viridisLite_0.3.0
## [43] later_0.8.0 crayon_1.3.4 withr_2.1.2
## [46] MASS_7.3-51.4 grid_3.6.0 xtable_1.8-4
## [49] jsonlite_1.6 gtable_0.3.0 magrittr_1.5
## [52] scales_1.0.0 stringi_1.4.3 reshape2_1.4.3
## [55] promises_1.0.1 latticeExtra_0.6-28 RColorBrewer_1.1-2
## [58] tools_3.6.0 glue_1.3.1 beeswarm_0.2.3
## [61] purrr_0.3.2 crosstalk_1.0.0 yaml_2.2.0
## [64] colorspace_1.4-1 cluster_2.1.0 BiocManager_1.30.4
## [67] knitr_1.23